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Abstract. Word W is said to encounter word V provided there is a homomorphism (f) mapping 
letters to nonempty words so that 4>{V) is a substring of W. For example, taking cp such that 
(p{h) = c and 0(ri) = ien, we see that “science” encounters “huh” since cienc = 4>(huh). The 
density of V in W, (5(U W), is the proportion of substrings of W that are homomorphic images of 
V. So the density of “huh” in “science” is 2 /( 2 ). A word is doubled if every letter that appears in 
the word appears at least twice. 

The dichotomy: Let U be a word over any alphabet, S a finite alphabet with at least 2 letters, 
and Wn G chosen uniformly at random. Word V is doubled if and only if E(5(U Wn)) —>• 0 as 
n —>• 00. 

We further explore convergence for nondoubled words and concentration of the limit distribution 
for doubled words around its mean. 


1. Introduction 

Graph densities provide the basis for many recent advances in extremal graph theory and the limit 
theory of graph (see Lovasz 0)- To see if this paradigm is similarly productive for other discrete 
structures, we here explore pattern densities in free words. In particular, we consider the asymptotic 
densities of a fixed pattern in random words as a first step in developing the combinatorial limit 
theory of free words. 

1.1. Definitions. Free words (or simply, words) are elements of the semigroup formed from a 
nonempty alphabet S with the binary operation of concatenation, denoted by juxtaposition, and 
with the empty word e as the identity element. The set of all finite words over E is E* and the set 
of E-words of length fc G N is E^. For alphabets F and E, a homomorphism (/> : F* —>■ E* is uniquely 
dehned by a function (j) •. T —> E*. We call a homomorphism nonerasing provided it is defined by 
: F —>• E* \ {e}; that is, no letter maps to e, the empty word. 

Let V and W be words. The length of W, denoted |1F|, is the number of letters in W, including 
multiplicity. Denote with L(1F) the set of letters found in W and with ||1T|| the number of letter 
repeats in W, so |1T| = |L(1F)| + ||1T||. For example \banana\ = 6, L{banana) = {a,b,n}, and 
||6anona|| = 3. W has substrings, each defined by an ordered pair {i,j) with 0 < i < j < \W\. 

Denote with lT[f, j] the word found in the (f, j)-substring, which consists oi j — i consecutive letters 
of W, beginning with the {i + l)-th. V is a factor of W, denoted V < W, provided V = W[i,j] for 
some 0 < i < j < |1F|; that is, W = SVT for some (possibly empty) words S and T. For example, 
banana[2, 6] = nana < banana. 

W is an instance of V, or V-instance, provided there exists a nonerasing homomorphism (/) such 
that W = (Here V is sometimes referred to as a pattern or pattern word). For example, 

banana is an instance of cool using homomorphism (j) defined by (j){c) = b, (j){o) = an, and (p{l) = a. 
W encounters V, denoted V ^ W, provided W is an instance of V for some factor W < W. 
For example cool ^ bananasplit. For W ^ e, denote with S(V,W) the proportion of substrings of 
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W that give instances of V. For example, 6{xx, banana) = 2 /( 2 ). SsuriV,W) is the characteristic 
function for the event that W is an instance of V. 

Fix alphabets F and S. An encounter of F in IF is an ordered triple (a, b, (j)) where W[a, b] = (j){V) 
for homomorphism ^ : F* —>■ E*. When F = L(F) and W G E*, denote with hom(y, W) the number 
of encounters of V in W. For example, hom(a6, cde) = 4 since cde[0,2] and ccie[l,3] are instances 
of ab, each for one homomorphism {a,b}* —>■ {c, d, e}*, and cde[0,3] is an instance of ab under two 
homomorphisms. Note that the conditions on F and E are necessary for hom(F, W) to not be 0 or 
00. 

Fact 1. For fixed words V and W e, 

8{y, W) < hom(y, W). 

1.2. Background. Word encounters have primarily been explored from the perspective of avoid¬ 
ance. Word W avoids a (pattern) word V provided V W. V is k-avoidable provided, from a 
A:-letter alphabet, there are infinitely many words that avoid V. The premier result on word avoid¬ 
ance is generally considered to be the proof of Thue uni that the word aa is 3-avoidable but not 
2-avoidable. Two seminal papers on avoidability, by Bean, Ehrenfeucht, and McNulty [T] and Zimin 
[iniia, include classification of unavoidable words-that is, words that are not fc-avoidable for any k. 
Recently, the authors [4] and Tao [9] investigated bounds on the length of words that avoid unavoid¬ 
able words. There remain a number of open problems regarding which words are fc-avoidable for 
particular k. See Lothaire [7] and Currie [6] for surveys on avoidability results and Blanchet-Sadri 
and Woodhouse [3] for recent work on 3-avoidability. 

A word is doubled provided every letter in the word occurs at least twice. Otherwise, if there is 
a letter that occurs exactly once, we say the word is nondoubled Every doubled word is fc-avoidable 
for some A: > 1 [7]. For a doubled word V with k > 2 distinct letters and an alphabet E with 
|E| = g > 4, (fc, g) ^ (2,4), Bell and Goh [2] showed that there are at least A(fc, g)" words in E” 
that avoid E, where 

V*^.9) = ’”(i + 5;r4l)i) ■ 

This exponential lower bound on the number of words avoiding a doubled word hints at the moral of 
the present work: instances of doubled words are rare. For a doubled word V and an alphabet E with 
at least 2 letters, the probability that a random word Wn G E" avoids V is asymptotically 0. Indeed, 
the event that IF„[&|E|, (& -I- 1)|E|] is an instance of V has nonzero probability and is independent 
for distinct b. Nevertheless, 6{V, Wn), the proportion of substrings of W that are instances of V, is 
asymptotically negligible. 


|IF| + 1 
2 


2. The Dichotomy 

In this section, we establish a density-motivated bipartition of all free words into doubled and 
nondoubled words. From there, we present a more detailed analysis of the asymptotic densities in 
these two classes. 

Theorem 2. Let V be a word on any alphabet. Fix an alphabet E with q > 2 letters, and let 
Wn G E" be chosen uniformly at random. The following are equivalent: 

(i) V is doubled (that is, every letter in V occurs at least twice); 

(ii) lim„_>oo ]E(<y(E, Wn)) = 0. 

Proof. First we prove (i) => (ii). Note that in Wn, there are in expectation the same number of 
encounters of V as there are of any anagram of V. Indeed, if V is an anagram of V and ^ is a 
nonerasing homomorphism, then |0(E')| = |(/)(E)|. 

Fact 3. IfV is an anagram of V, t/ien E(hom(E, IF„)) = E(hom(E', IFt)). 
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Assume V is doubled and let F = L{V) and k = |r|. Given Fact [31 we consider an anagram 
V = XY of y, where |A| = k and F = L(A) = L(F). That is, X comprises one copy of each letter 
in F and all the duplicate letters of V are in Y. 

We obtain an upper bound for the average density of V by estimating E(hom(y', fF„)). To do 
so, sum over starting position i and length j of encounters of X in VF„ that might extend to an 
encounter oiV. There are homomorphisms (j) that map X to Wn[i,i + j] and the probability 

that Wn[i + J,i + i + |?^>(^)|] = is at most q~Y Also, the series Yl^=k (fe+i)?”'’ converges (try 
the ratio test) to some c not dependent on n. 


E{6{V,Wr,)) < 


< 


< 


^E(hom(y',W„)) 

V 2 j 



^ n-\V\ 

V 2 / i=0 

c(n - |y| + 1) 

O(n-i). 


We prove (ii) <= (i) by contraposition. Assume there is a letter x that occurs exactly once in 
V. Write V = TxU where L(y) \ L(T17) = {x}. We obtain a lower bound for E((5(y, W„)) by only 
counting encounters with \(j){TU)\ = \TU\. Note that each such encounter is unique to its instance, 
preventing double-counting. For this undercount, we sum over encounters with Wn[i,i + j] = X^)- 


E{6{V,Wn)) 


> 


> 


E{S{TxU,Wn)) 



n-\U\-li-\T\ 

E E?-"™" 

i=|T| J = 1 


I|t;7||_ 

rr: 


n-|f7|-l 

E (*-1^1) 


i=\T\ 


q-\\TU\\ 

^-IITOII 

0 . 


(-r') 


□ 


It behooves us now to develop more precise theory for these two classes of words: doubled and 
nondoubled. Lemma [5] below both helps develop that theory and gives insight into the detrimental 
effect that letter repetition has on encounter frequency. 

Fact 4. For r = {ri,... ,rk} S (Z+)*^ and d = gcdjgj;.](rj), there exists integer N = N- such that 

for every n > N there exist coefficients ai,... ,ak S Z+ such that dn = A/i=i Oi < N for 

i > 2. 

Lemma 5. For any word V, let F = L(y) = {xi^.. . ,Xk} where Xi has multiplicity r^ for each 
i € [fc]. Let U he V with all letters of multiplicity r = minjg[j.](ri) removed. Finally, let S be any 
finite alphabet with |S| = g > 2 letters. Then for a uniformly randomly chosen V-instance W G 
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where d = gcdjgj^,] (r^), there is asymptotically almost surely a homomorphism : F* —>■ E* with 
4>{V) = W and \4‘{U)\ < -x/dn. 

Proof. Let a„ be the number of T^-instances in E" and be the number of homomorphisms (j) : 
r* —>• E* such that = n. Let 6^ be the number of these 4> such that f’iU) < \fn and 6^ 

the number of all other f) so that bn = + h^. Similarly, let a\ be the number of IZ-instances in 

E" for which there exists a (j) counted by b\ and the number of instances with no such (j), so 
Un = ttn + Un- Observe that a'^<b'^. 

Without loss of generality, assume ri = r (rearrange the Xi if not). We now utilize N = Nr 
from Propositional For sufficiently large n, we can undercount by counting homomorphisms (j) 
with |(/)(a;i)| = Oi for the Oi attained from Fact |4l Indeed, distinct homomorphisms with the same 
image-length for every letter in V produce distinct F-instances. Hence 




> q 

(dn-\ 

= cqWl, 


( d„_(fc_l);v +^(^_l)) 


where c = depends on V but not on n. To overcount 6^ (and by extension), we 

consider all ways to partition an n-letter length and so determine the lengths of the images 

of the letters in V. However, for letters with multiplicity strictly greater than r, the sum of the 
lengths of their images must be at least -y/n. 


b 


2 

n 



j( r +T+l) 





= g"o(l). 


^dn < 


bin 


= o(a^„). 

That is, the proportion of F-instances of length dn that cannot be expressed with |^(17)| < Vdn 
diminishes to 0 as n grows. □ 


3. Density of Nondoubled Words 

In Theorem [2J we showed that the density of nondoubled F in long random words (over a fixed 
alphabet with at least two letters) does not approach 0. The natural follow-up question is: Does the 
density converge? To answer this question, we first prove the following lemma. Fixing F = TxU 
where a: is a nonrecurring letter in F, the lemma tells us that all but a diminishing proportion of 
F-instances can be obtained by some (p with \(j>{TU)\ negligible. 

Lemma 6. Let V = UqXiUiX 2 ■ ■ ■ XrUr with r >\, where U = UqUi ■ ■ - Ur is doubled with k distinct 
letters (though any particular Uj may be the empty word), the Xi are distinct, and no Xi occurs in U. 
Further, let F be the (k -I- r)-letter alphabet of V and let E be any finite alphabet with q > 2 letters. 
Then there exists a nondecreasing function g{n) = o{n) such that, for a randomly chosen V-instance 
IF G E", there is asymptotically almost surely a homomorphism ^ : F* —>■ E* with (j){V) = IF and 
\(l){Xr)\ > n- g{n). 
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Proof. Let Xi = X 1 X 2 ■ ■ ■ Xi for 0 < i < r {so Xq = e). For any word W, let be the set of 
homomorphisms {0 : F* —>■ E* | (j>{V) = W} that map V onto W. Define Pi to be the following 
proposition for i € [r]: 

There exists a nondecreasing function fi{n) = o{n) such that, for a randomly chosen 
y-instance W € S", there is asymptotically almost surely a homomorphism (j) G 
such that \cl){UXi-i)\ < fi{n). 

The conclusion of this lemma is an immediate consequence of P,., with g{n) = fr{n), which we 
will prove by induction. Lemma [S] provides the base case, with r = 1 and fi{n) = y/n. 

Let us prove the inductive step: Pi implies Pi+i for j G [r — 1]. Roughly speaking, this says: 
If most instances of V can be made with a homomorphism (f where |0(t/Xi_i)| is negligible, then 
most instances of V can be made with a homomorphism (j) where \(j){UXi)\ is negligible. 

Assume Pi for some i £ [r — 1], and set f{n) = fi{n). Let A„ be the set of R-instances in E" 
such that \(j){UXi-i)\ < f{n) for some (j) £ ^w- Let Bn be the set of all other R-instances in E". 
Pi implies |B„| = o(|A„|). 

Case 1: Ui = e, so Xi and Xij^i are consecutive in V. When \(j){UXi-i)\ < /(n), we can define 
■0 so that '0(xiXi+i) = 0(xiXi+i) and |0(xi)| = 1; otherwise, let 0(?/) = (j){y) for y G F \ {xi,Xi+i}. 
Then |0(C/Xi)| < f{n) + 1 and Pi+i with fi+i{n) = fi{n) + 1. 

Case 2: Ui ^ e,so\Ui\ > 0. Let g{n) be some nondecreasing function such that f{n) = o{g{n)) and 
g{n) = o{n). (This will be the /i+i for Pi+i.) Let A“ consist of W £ An such that |0(17Xi)| < g{n) 
for some 0 G Let A^ = An \ A“. The objective henceforth is to show that \A^\ = o(|A“|). 

For Y G A^, let <I>y be the set of homomorphisms {0 G <I>y : |0(C/A'i_i)| < f{n)} that disqualify 
Y from being in R„. Hence Y £ An implies ^ 0. Since Y ^ A“, 0 G implies |0(C/Xi)| > g{n), 
so |0(xi)| > g{n) — f{n). Pick 0v G $y as follows: 

• Primarily, minimize \(p{UoXiUiX 2 ■ ■ ■ Ui-iXi)\] 

• Secondarily, minimize |0(17i)|; 

• Tertiarily, minimize |0(17oXil7iX2 • • • 17i_i)|. 

Roughly speaking, we have chosen 0y to move the image of Ui as far left as possible in Y. But 
since Y ^ A“, we want it further left! 

To suppress the details we no longer need, let Y = Yi(j)Y{xi)4>Y{Ui)(j)Y{xi+i)Y2, where Yi = 
(j)Y{UoXiUiX2 ■ ■ ■ Ui-i) and y2 = 0Y(C/i+lXi+2 • • • Ur). 

Consider a word Z G F" of the form YiZicfY{Ui)Z24>Y{Ui)4>Y{xi+i)Y 2 , where Zi is an initial 
string of (j)Y{xi) with 2/(n) < \Zi\ < g{n) — 2f{n) and Z 2 is a final string of (j)Y{xi). (See Figure 
1.) In a sense, the image of Xi was too long, so we replace a leftward substring with a copy of the 
image of [0. Let Cy be the set of all such Z with \Zi\ a multiple of /(n). For every Z £ Cy we can 
see that Z G A“, by defining 0 G as follows: 

{ Zi if y = Xi-, 

Z24>Y{Ui)4>Y{Xi+l) lfy = Xi+i] 

(j)Y (y) otherwise. 


Y = 

Yi 


4>Y{Xi) 

0Y(C^) 

4>Y{Xi+l) 

Y 2 

Z = 

Yi 

Zi 

4>Y{Ui) 

Z 2 

4>Y{Ui) 

4’Y{Xi+l) 

Y 2 



IpiXi) 


0( 

Xt+l) 




Figure 1. Replacing a section of (j)Y{xi) in Y to create Z. 


Claim 1: lim inf | Cy | = 00 . 

|y |=n^oo 
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Since we want 2/(n) < jZij < g{n) — 2/(n), and g{n) — 2/(n) < \(j)Y{xi)\ — |(^y(C/i)|, there are 
g{n) — 4/(n) places to put the copy of <j)Y{Ui). To avoid any double-counting that might occur when 
some Z and Z' have their new copies of (j)Y(Ui) in overlapping locations, we further required that 
f{n) divide \Zi\. This produces the following lower bound: 


|Cy|> 


g(n) -4/(n) 
f{n) 


—>■ oo. 


Claim 2: For distinct Y, Y' G A^, Cy H Cy' = 0- 

To prove Claim 2, take T, Y' S with Z G Cy H Cy'- Define Yi = 4>YiUoXiUiX2 ■ ■ ■ Ci-i) and 
Y 2 = (l)Y{Ui+iXi +2 ■ ■ ■ Ur) as before and Y( = (j)Y>{UoXiUiX 2 ■ ■ ■ U^-i) and Y 2 = (j)Y'{Ui+iXi +2 ■ ■ ■ Ur). 
Now for some Zi, Z[, Z 2 , Z 2 , 

YiZ^4>YmZ2(t>Ym(t>Y{x,+i)Y2 = Z = YiZ[^Y'mZ'2(l>Y'm^Y'{x,+i)Y^, 
with the following constraints: 

(i) |Yi<^y(C,)| < |</)y(C/X,)| </(n); 

(ii) \Yi<)>Y'm\< \ct^Y'{UX,)\ < f{n); 

(hi) 2/(n) <\Zi\ < g{n) - 2/(n); 

(iv) 2/(n) < \Z[\ < g{n) - 2/(n); 

(v) \Zi(l)Y{Ui)Z 2 \ = \(j)Y{xi)\ > g{n) - f{n); 

(vi) \Z[(l)Y'iUi)Z' 2 \ = \(l}Y'ixi)\ > gin) - /(n). 

As a consequence: 

• \YiZi(j)Y{Ui)\ < gin) - fin) < \Z[(j)Y'iU^)Z' 2 \, by (i), (hi), and (vi); 

• |YiZi| > jZil > 2/(n) > |Y/|, by (hi) and (ii). 

Therefore, the copy of (j)YiUi) added to Z is properly within the noted occurrence of Z[(j)Y'iUi)Z 2 
in Z', which is in the place of 4>Y'ixi) in Y'. In particular, the added copy of 4>YiUi) in Z in¬ 
terferes with neither Y/ nor the original copy of (j>Y'iUi). Thus Y/ is an initial substring of Y 
and (l)Y'iUi)4>Y'ixi+i)Yf is a Hnal substring of Y. Likewise, Yi is an initial substring of Y' and 
4>YiUi)(j)Yixi+i)Y2 is a final substring of Y'. By the selection process of (j)Y and (j)Y', we know that 
Yi = Y{ and (l)YiUi)4iYixi+i)Y2 = (j)Y'iUi)(j)Y'ixi+i)Y 2 . Finally, since /(n) divides Zi and we 
deduce that Zi = Z[. Otherwise, the added copies of (j)YiUi) in Z and of cfY'iUi) in Z' would not 
overlap, resulting in a contradiction to the selection of (j)Y and 4>y'. Therefore, Y = Y'^ concluding 
the proof of Claim 2. 

Now Cy C A“ for Y G A^. Claim 1 and Claim 2 together imply that \A^\ = o(|A“|). 

□ 


Observe that the choice of ^/n in Lemma [5] was arbitrary. The proof works for any function 
fin) = o(n) with fin) —> 00 . Therefore, where Lemma |6] claims the existence of some gin) —>■ 00 , 
the statement is in fact true for all gin) —>■ 00 . 

Let I„(Y, E) be the probability that a uniformly randomly selected length-n E-word is an instance 
of V. That is, 


In(Y,E) 


|{IY G E" I r/>iV) = W for some homomorphism </> : L(Y)* —^ E*}| 


Fact 7. For any V and E and for W„ G E" chosen uniformly at random, 

(^'' + ^^E(5(Y,IY„)) = f^(n + l-m)E(<5«„,(Y,IY„)) 

n 

= y^ jn + l- 77i.)I^(Y, E). 

m—1 


Denote I(Y, E) = lim„_>oo In(Y, E). When does this limit exist? 
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Theorem 8. For nondoubled V and alphabet S, I(V, S) exists. Moreover, I(y, E) > 0. 

Proof. If |E| = 1, then E) = 1 for n > \V\. 

Assume |E| = q > 2. Let V = TxU where x is the right-most nonrecurring letter in V. Let 
r = Ij{V) be the alphabet of letters in V. By Lemma[6l there is a nondecreasing function g{n) = o{n) 
such that, for a randomly chosen ^-instance IT S E", there is asymptotically almost surely a 
homomorphism ^ : L* —>• E* with (j){y) = IT and \(j){xr)\ > n — g{n). 

Let On be the number of IT G E" such that there exists : L* —>■ E* with (j){V) = IT and 
\(t){xr)\ > n — g{n). Lemma [5] tells us that ~ In(T, E). Note that is bounded. It suffices to 
show that a„+i > ga„ for sufficiently large n. Pick n so that g{n) < 

For length-n T-instance IT counted by On, let (fw be a homomorphism that maximizing \(l)w{x:r)\ 
and, of such, minimizes \(pwiT)\- For each (pw and each a G E, let be the function such that, if 
<Pw{Xr) = AB with |A| = [\(j)w{xr)\/2\, then (/)^(x) = AaB] = pwiv) for each y Gr\{x} 

Roughly speaking, we are inserting a into the middle of the image of x. 

Suppose we are double-counting, so 0^(T) = pylV). As 

\(pw{xr)\/2 > (n - g{n))/2 > n/3 > giri) > \(j)Y{TU)\ 

and vice-versa, the inserted a (resp., b) of one map does not appear in the image of TU under the 
other map. So 4>w{T) is an initial string and 4>wiU) a final string of 0y(T), and vice-versa. By the 
selection criteria of pw and py, \4>w{T)\ = \4>y{T)\ and \(j)w{U)\ = I<^y(T)|. Therefore the location 
of the added a in 4>^{V) and the added b in are the same. Hence, a = b and IT = T. 

Moreover I(T, E) > >0. □ 

Example 9. Let V = X 1 X 2 ■ ■ ■ Xk have k distinct letters. Since every word of length at least k is a 
V-instance, I(T, E) = 1 for every alphabet E. When even one letter in V is repeated, finding I(T, E) 
becomes a nontrivial task. 

Example 10. Zimin’s classification of unavoidable words is as follows [111 112) ; Every unavoidable 
word with n distinct letters is encountered by Zn, where Zq = e and Zi+i = ZiXi+iZi with Xi+i a 
letter not occurring in Zi. For example, Z 2 = aha and Z 3 = abacaba. The authors can calculate 
1(^2, E) and 1(^3, E) to arbitrary precision 

Table 1. 1(^2, E) and 1(^3, E) computed to 7 decimal places. 


|S| 

2 

3 

4 

5 

6 

7 


1(^2, E) 

0.7322132 

0.4430202 

0.3122520 

0.2399355 

0.1944229 

0.1632568 


1(^3, E) 

0.1194437 

0.0183514 

0.0051925 

0.0019974 

0.0009253 

0.0004857 



Corollary 11. Let V be a nondoubled word on any alphabet. Fix an alphabet E, and let Wn G E" 
be chosen uniformly at random. Then 

lim E(d(T,IT„)) = I(T,E). 

n—^oo 

Proof. Let I = I(T, E) and e > 0. Pick N = Nf_ sufficiently large so |I — I„(T, E)| < | when n > N. 
Applying Fact[7]for n > max(T,4T/e), 


|I-E((5(T,IT„))| 


.. IL ^ IL 

“tf+TV ^ + 1 - w) - 7;i+TY I] (^ + 1 - S) 

V 2 j m—1 V 2 / m—1 

1 "" 

\ 2 ) m=l 


< 
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< 

< 

< 


1 



1 

ff) 

1 

ff) 


r N n 


E+ E 

_m—l m—N-\-l_ 


{n + l 


i;)| 


Len/4J n 

(n + l —m)l+ (n+l —m)- 

m—l 


en 

T" + 



e 

2 


4. Concentration 


□ 


For doubled V and |i;| > 1, we established that the expectation of the density S{V, Wn) converges 
to zero. In particular, we know the following. 

Proposition 12. Let V be a doubled word, S an alphabet with q > 2 letters, and S S” chosen 
uniformly at random. Then 

E{SiV,Wr,))^-. 

n 

Proof. In the proof of Theorem [2l we showed that 

(e“ . (d;)9->) (»- iv\+ 1) 


E{S{V,Wn)) < 


rr) 


= 0(n-i). 


The lower bound follows from an observation made in the Background section: “the event that 
kF„[ 6 |P|, {b + 1)|P|] is an instance of V has nonzero probability and is independent for distinct 6 .” 
Hence 


E(d(P,kF„)) > 


1 


rr) 


1^1 


I|v|(P,E) = fI(n-i). 


□ 


To bound variance and other higher order moments, we observe the following upper bound on 
E). Hencefore, if is used with nonintegral x, we mean 

\yj y! 

Lemma 13. Let V be a doubled word with exactly k letters and E an alphabet with q > 2 letters. 
Moreover, let L(P) = {xi,... ,Xk} with ri be the multiplicity of Xi in V for each i € [fc], d = 
Scd,^[k]iri), and r = min,g[fc](r,). Then, 


I„(P,E) < 


/n/d+fe+l\ n{l-r)/r 

V k + 1 + 


Proof. Let a„(r) be the number of /c-tuples a = (ai, • • • , at) S so that Then 

an(n) < Indeed, if d / n, then a„(r) = 0. Otherwise, for each a counted by Unfr), there 

is a unique corresponding b G (Z+)^ such that 1 < 6 i < 62 < • • • < = n/d and bj = j X]i=i 

The number of strictly increasing fc-tuples of positive integers with largest value n/d is ■ 

Let Wn G E" chosen uniformly at random. Note that E) is the number of instances of V in 

E". Thus, 


r 7 "I„(I/,E) < E(hom(t/,IF„)) < ■ 
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□ 


We obtain nontrivial concentration around the mean using covariance and the fact that most 
“short” substrings in a word do not overlap. 

Theorem 14. Let V be a doubled word with k distinct letters, S an alphabet with q > 2 letters, and 
Wn G S" chosen uniformly at random. 

Var((5Cl/,W„)) = O . 

Proof. Let Xn = {^1f^)5{V,Wn) be the random variable counting the number of substrings of Wn 
that are ^-instances. For fixed n, let Xa.b be the indicator variable for the event that Wn\a,b] is 
a Winstance, so = J2a=o'^b=a+i Let {a,b) ^ {c,d) denote that [a, &] and [c,d] overlap. 
Note that 

COY{Xa,b,X,^d) < E{XaMX,^d) 

< min(E(Xa,b),E(Xc,d)) 

= min(I({,_a)(l/, S),I({,_a)(y, E)) 

< i{l-r)/r 

- \ k+1 r 

for i G {b — a, d — c}. For i < n/3, the number of intervals in Wn of length at most i that overlap 
a fixed interval of length i is less than (■ Define the following function on n, which acts as a 
threshold for “short” substrings of a random length-n word: 

s(n) = —2 logq(n“^^^®^) = flogn, 
where t = > 0. For sufficiently large n, 


Var(X„) 


^ COY{Xa,b,X,,d) 

0<a<6<n 

0<c<(i<n 

< ^ min(I(b_a)(t/, E), !({,_„) (F,E)) 

{a,b)r^{c,d) 


E + E 

{a,b)r^{c,d) (a,b)~(c,d) 

b—a,d—c<s{n) else 


min(I(b_a) {V, E), !({,_„) (V, E)) 


[s(ra)J 


< 2 ^ (n + 1 — z) 


3z 


“b ^ i/d “b A: -|- 1 

k + 1 


+ (n + l-i) 

i=\s{n)] 

< 2s(n)n(3s(n))2 + 

= 18(tlogn)3n + n®+'=g'°s«(””''‘'"'’) 

= 0(n(logn)3). 


z(l-r)/r 


Since E((5(y, IF„)) = fl{n by Corollary [T^ 
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Ya.TiSiV,Wn)) = Var 




,(T), 


Var(X„ 


= O 


ctr 

{\ognf 


O (E{5{V,Wn)y^^^ 


□ 


Lemma 15. Let V be a word with k distinct letters, each occurring at least r G times. Let S 
be a q-letter alphabet and Wn G E” chosen uniformly at random. Recall that Wn) is the 

number substrings of Wn that are V-instances. Then for any nondecreasing function f{n) > 0, 

E(^(^~^^'^S{V,Wn) >n-/(n)^ < „fc+3g/(n)(l-r-)/r^ 

Proof. Lemma [13] gives a bound on the probability that randomly chosen Wn € E” is a F-instance: 

Wn) = 1) = S) < q-Wr)/r_ 

Since Ssur{V, W) G {0,1}, 

L/(")l n-m 

SsuriV,Wn[i,i-\-m\) < n-f{n). 

m —1 £—0 


Therefore, 


( / -I- 1 \ \ / 

jS{V,Wn)>n-f{n)j = Pf y] y] + >n./(n) 


m=l e=0 

n n—m 


< 


< 


Y. Y^sur{V,Wn[i,i + m])>0 

l,m=r/(".)l (=0 


n n—m 


< 


Y y] + >0) 

"*=r/(")i ^=0 

n 

Y. {n-m-\-l)F{dsur{V,Wm,) = 1) 

m=\f{n)] 


r)/r 


"*=r/(")i 

< n{n-m+l) 


□ 


Theorem 16. Let V be a doubled word, E an alphabet with q > 2 letters, and Wn G E" chosen 
uniformly at random. Then the raw moment and the p*^ central moment of S(V, Wn) are both 
0((log(n)/nf). 
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Proof. Let us use Lemma [T51 to first bound the p-th raw moments for 6 {V, W„), assuming r > 2. To 
minimize our bound, generalize the threshold function from Theorem 1 141 

Spin) = Y^log,(n"('=+®+P)) = tplogn, 


^^ere tp = > 0. 


m 


EiSiV, lT„r) = E P 1T„) = j ( ^ 


i^O 

[n-Sp(n)J / ■ \ ■ 

< E = 

(^r) 

+ E 

i=\n.Sp(n)'] 


^/c+3gSp(n)(l-r)/r / * 


rr). 


< 


n ■ Spjn) \ sp{n){l-r)/r 

, rr) ' 

ntp log n \ /„-(.+5+p)) 

. ("J‘) ) 

logn 


= Op 

Setting p = 1, there exists some c > 2 such that E„ = E{5{V,Wn)) < (clogn)/n. We use this 
upper bound on the expectation (1st raw moment) to bound the central moments. 


E(|5(y,W„)-E„n = 


< 


< 



[n.Sp(ra)J 

E 


2=0 


i^iV.Wn) 



i \ f clogn^^ 


[ i ) / . \ 

i—\nsp{n)] \ V 2 / / 

^ C log n ^ _|_ ^fe+5^Sp(n)(l-r)/r- 


( 1 ) 


P 


□ 


Question 17. For nondoubled word V, to what extent is the density ofV in random words concen¬ 
trated about its mean? 
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